Nature Machine Intelligence
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Nature Machine Intelligence's content profile, based on 61 papers previously published here. The average preprint has a 0.14% match score for this journal, so anything above that is already an above-average fit.
Singh, A.; Yadav, D.; Ali, A.; Jack, J.
Show abstract
The TCR-peptide binding landscape evolves continuously: novel pathogens (SARS-CoV-2, emerging influenza variants) and newly characterized tumor neoantigens introduce epitope families with no training precedent. Deploying a static model trained on historical data leads to degraded performance on emerging epitopes, while naive fine-tuning on new data causes catastrophic forgetting--erasing performance on previously learned epitopes. We introduce ContinualTCR, a continual learning framework that combines reservoir replay with Elastic Weight Consolidation (EWC) regularization to balance stability (retaining old-epitope performance) and plasticity (adapting to new epitopes). Evaluated on a temporally partitioned VDJdb- IEDB benchmark across four sequential epitope arrival tasks, ContinualTCR achieves new-epitope AUROC 0.812 and old-epitope AUROC 0.781 simultaneously--reducing catastrophic forgetting by 62.9% relative to naive fine-tuning. A streaming evaluation protocol with per-task backward transfer (BWT) reporting reveals that replay alone resolves 30.6% of forgetting, EWC alone resolves 57.3%, and their combination achieves synergistic complementarity. These results establish continual learning as a necessary component of production TCR specificity systems that must adapt to evolving pathogen and neoantigen landscapes without requiring full retraining.
Mascart, C.; Tran, K.; Samoilova, K.; Storan, L. T.; Liu, T.; Koulakov, A.
Show abstract
Recent advances in deep learning have enabled prediction of odorant perception from molecular structure, opening new avenues for odor classification. However, most existing models are limited to predicting percepts from fixed vocabularies and fail to capture the full richness of olfactory experience. Progress is further limited by the scarcity of large-scale olfactory datasets and the lack of standardized metrics for evaluating free-form natural-language odor descriptions. To address these challenges, we introduce Odor Description and Inference Evaluation Understudy (ODIEU), a benchmark which includes perceptual descriptions of over 10,000 molecules paired with a model-based metric for evaluating free-form odor text descriptions. The model-based metric uses Sentence-BERT (SBERT) models which are fine-tuned on olfactory descriptions to allow better evaluation of human-generated odor texts. Using the fine-tuned SBERT models, we show that free-form text odor descriptions contain additional perceptual information in their syntactic structure compared to semantic labels. We further introduce CIRANO (Chemical Information Recognition and Annotation Network for Odors), a transformer-based model that generates free-form odor descriptions directly from molecular structure, thus implementing the molecular structure-to-text (S2T) prediction. CIRANO achieves performance comparable to humans. Finally, we generate human-like descriptions from mouse olfactory bulb neural data using an invertible SBERT model, yielding neural-to-text (N2T) predictions highly aligned with human descriptions. Together, CIRANO and ODIEU establish a standardized framework for generating natural language olfactory descriptions and evaluating their alignment with human perception. Code is available at https://github.com/KoulakovLab/ODIEU
Song, S.; Shi, H.; Wu, H.; Liu, D.; Lin, Y.; Mat Isa, N. A.; Zou, Q.; Wei, L.
Show abstract
Accurate prediction of effector proteins secreted by Gram-negative bacteria is important for elucidating bacterial pathogenic mechanisms and developing precise anti-infective strategies. Although existing methods have benefited from the strong sequence feature extraction capacity of pretrained protein language models, reliance on linear sequence information alone often fails to fully capture the three-dimensional conformational signals required for virulence functions. Meanwhile, conventional structure-based methods are limited by the scarcity of experimentally resolved protein structures. To address these challenges, We propose GeoEPred, a multimodal deep learning framework designed for the synergistic modeling of protein sequence and structure to identify Gram-negative bacterial effector proteins. Specifically, the model integrates sequence-contextual embeddings from a pretrained protein language model with three-dimensional structural representations predicted by ESMFold. A feature projection network refines fine-grained sequence signals associated with effector functions, while geometric vector perceptrons characterize inter-residue orientations, distances, and local spatial topology to capture potential structural conformational motifs. To further enable effective cross-modal fusion, we design a cross-modal alignment and feature-tokenized self-attention module. This module enhances consistency between the sequence-semantic and structural-geometric spaces through contrastive learning and models associations between linear functional motifs and spatial conformational patterns at a fine-grained token level. Extensive evaluations on multiple benchmark datasets show that GeoEPred achieves better predictive performance than existing leading models in T3SE, T4SE, and T6SE prediction tasks, while maintaining stable performance in remote homolog recognition scenarios. Moreover, the modular and extensible architecture of GeoEPred demonstrates strong generalization ability and substantial application potential for genome-scale effector protein discovery. Author summarySecreted effector proteins are central virulence factors used by many Gram-negative bacterial pathogens to execute infection strategies. Their functions are governed not only by secretion signals and short linear motifs in the amino acid sequence, but also by three-dimensional folds, local domains, and surface geometric patterns. However, current predictors mainly exploit sequence-contextual features, limiting their ability to model the correspondence between linear sequence signals and spatial conformational motifs, and thereby constraining accuracy and interpretability. Here, we present GeoEPred, a multimodal deep learning framework for secreted effector protein identification. GeoEPred couples sequence-semantic embeddings from a pretrained protein language model with structural representations learned by geometric vector perceptrons. A cross-modal alignment and interaction module uses contrastive learning to improve functional consistency between sequence and structure modalities, while feature-token attention captures fine-grained links between key linear and conformational motifs. Across benchmark datasets covering multiple effector types, GeoEPred outperforms existing state-of-the-art methods and provides interpretable evidence from sequence fragments, structural regions, and cross-modal associations, supporting functional annotation, pathogenic mechanism analysis, and experimental validation.
Bian, S.; Qiao, H.; Yan, T.; Xia, Z.; Gao, X.; Xu, Y.; Shen, R.; Ma, T.; Guan, Z.; Wang, Y. X.; Wong, T. Y.; Dai, Q.
Show abstract
Foundation models (FMs) are powerful tools to allow the broad clinical application of artificial intelligence (AI) in healthcare systems, offering adaptability to different disease, modalities and clinical settings. However, FMs require large-scale datasets to train and fine-tune, while most real-world data are localized in siloed healthcare settings with strict data privacy protection, a restriction that poses a fundamental challenge in the cross-healthcare institution development of FMs. Here, we develop a fully homomorphic collaborative learning framework, named as FOCAL, that enables secure FM-driven diagnosis without exposing raw patient information. Different from traditional federated learning (FL) frameworks that aggregate locally trained models, FOCAL integrates fully homomorphic encryption (FHE) with split training to effectively execute collaborative learning completely over encrypted data. Specifically, we apply FOCAL on different types of retinal and pathology FMs to demonstrate its clinical performance. When facing gradient inversion attacks, FOCAL reduced the data leakage rate from 90.6% to 0% with comparable accuracy performance of the state-of-the-art FL paradigms, owing to the provable security provided by FHE. Moreover, under the same level of security, FOCAL can boost the macro-average AUROC by nearly 50% (from 0.5202 to 0.9831) when evaluated against fully encrypted FL models. In the multi-institution comparative experiments, FOCAL consistently outperforms all single-institution FMs, improving AUROCs by 9.62% and 14.46% on the ocular disease diagnosis and severity classification, respectively. Lastly, external validations on both retinal and pathology FMs further verified the accuracy and security advantages of FOCAL and highlighted its reliable interpretability and generalizability for cross-institution clinical development and implementation of FMs. FOCAL is a novel method to build a secure data-sharing AI community, facilitating healthcare institutions to benefit from and contribute to next-generation FMs development without compromising patient privacy and data security.
HOU, Z.; Lee, V. H.-F.; Kwong, D. L.-W.; Guan, X.; Liu, Z.; Dai, W.
Show abstract
The advent of artificial intelligence (AI) has brought revolutionary tools for biomedical transcriptomic (RNA-level) research. However, there are persistent constraints including limited interpretations with biomedical concepts such as functional pathways, small sample sizes and substantial time and computing power requirements for AI training. To overcome these limitations, we developed RNAGAN (https://github.com/ZhaozhengHou-HKU/RNAGAN-1.0.git), an AI tool with a generative adversarial network (GAN) structure with the objective of enhancing transcriptomic analysis. The network was established based on public human datasets comprising 4.6 million single cells from multiple organs and 5,900 sequenced samples of various cancer types with normal references. A specialized pathway neural layer was embedded to extract activities of predefined pathways from the Human Molecular Signatures Database (MSigDB), or newly learned pathways from single-cell data. The structure of RNAGAN (generator and discriminator) enables four applications after one shared training procedure: 1. single-cell and bulk-level patient stratification or differential diagnosis; 2. analysis of the gene and pathway markers in a selected disease; 3. pseudo data generation when sample size is limited for downstream analysis; 4. vectorization with gene and pathway-level features learned from multiple data sets. RNGAN contributes to the efficient utilization of limited data for transcriptomic studies.
Garimella Narasimha, S. V.; Brown, N.; Sridhar, S.
Show abstract
Automated glaucoma screening from optical coherence tomography (OCT) faces two persistent challenges: scarcity of expert-labeled data and unreliable model predictions on diagnostically ambiguous cases. We present a two-tier diagnostic pipeline that addresses both. In the first tier, an EfficientNetV2-S classifier trained under a semi-supervised pseudo supervisor framework achieves 0.84 AUC on 150 held-out test patients from the Harvard Glaucoma Detection and Progression dataset, using only 350 labeled training samples out of 700. In the second tier, 124 flagged cases are routed to a multi-agent system built on MedGemma 4B, where three specialist agents deliberate over three rounds before rendering a final diagnosis. On these flagged cases, the agent system achieves 100% sensitivity--detecting all 55 glaucoma cases with zero missed diagnoses--and 89.5% overall accuracy (111/124), compared to the classifiers 73.4% (91/124). Uncertainty analysis confirms that the classifiers output probability reliably separates confident predictions (96.3% accuracy, n = 27) from uncertain ones (74.0%, n = 123), producing a 22-percentage-point gap that serves as a triage signal. The agents fix 32 cases the classifier misclassifies while introducing 12 new errors, yielding a net improvement of 20 cases. These results are from a single training run without variance estimates and should be interpreted as preliminary evidence that uncertainty-gated routing to vision-language model agents can meaningfully improve diagnostic accuracy on the cases where automated classifiers are least reliable.
Dee, W.; Wenteler, A.; Seal, S.; Morris, O.; Slabaugh, G.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWPervasive batch effects are a common issue, especially in recent large-scale Cell Painting datasets, which have been produced to aid AI-enhanced drug discovery efforts. Technical differences arising from experiments carried out in different batches can cause models to fail to generalize to unseen batches, despite good predictive performance "within batch". We propose a biologically grounded test-time adaptation framework, SHOT-CCR, which uses cell-invariant gradient reversal to decouple morphological signal from experimental confounders. Our approach performs 4.5% better than the current RxRx1 benchmark, classifying 1,139 classes of siRNA genetic perturbations with 91.6% accuracy. We deliver consistent results over four distinct cell types and two prominent Cell Painting datasets - RxRx1 and a subset of JUMP-CP. Across 484 classes of CRISPR perturbations in JUMP-CP our method improves accuracy by 15.7%.
Treloar, N. J.; Ur-Rehman, S.; Yang, J.
Show abstract
Self-supervised pretraining has become central to biological machine learning, yet microbiome data remains comparatively underexplored in terms of both modeling approaches and evaluation frameworks. To address this gap, we present Atlas, a pretraining dataset of over 539,000 microbiome datapoints from the MGnify database. Using Atlas, we train the Waypoint family of microbiome foundation models: a series of GPT-2 style causal language models ranging from 6M to 170M parameters. We also introduce Compass, a curated benchmark of eight predictive tasks spanning biome classification, drug-microbiome interactions, drug degradation, and infant gut development. Using this benchmark, we compare the performance of Waypoint models against classical baselines and the existing MGM foundation model. Our results show that pretraining leads to consistent and significant improvements in downstream task performance, that both dataset scale and tokenization strategy impact model quality, and that pretraining is essential for achieving favorable scaling behavior. Furthermore, pretrained transformer models begin to reliably outperform classical methods once training data exceeds roughly 10,000 examples - a threshold that is attainable for modern microbiome studies. Finally, we demonstrate that the Waypoint models achieve state-of-the-art performance among microbiome foundation models. Overall, our work highlights the importance of large-scale self-supervised pretraining in this domain and establishes Atlas, Compass, and the Waypoint models as valuable resources for the research community in this emerging field.
Jin, W.; Tan, W.; Li, H.; Ji, X.; Li, M.; Zhang, D.; Xu, J.
Show abstract
Codon usage bias is highly species-specific, posing a major challenge for heterologous protein expression. Existing deep learning approaches to codon optimization rely primarily on DNA or protein sequence information and largely neglect constraints imposed by protein structure and folding. Here, we present Protein structure-Informed Species-specific Codon Optimization (PISCO), a Geometric Vector Perceptron (GVP)-based model that integrates protein sequence, three-dimensional protein structure, and host codon usage statistics to generate optimal, host-specific codon sequences. Compared with protein-structure-agnostic models, PISCO improves codon recovery by 6% and substantially increases similarity to natural coding sequences, reducing divergence by at least 42% in Codon Similarity Index (CSI), 50% in Codon Frequency Distribution (CFD), and 14% in Dynamic Time Warping (DTW) metrics. Ablation analyses demonstrate that incorporating protein folding kinetics and host-specific information is critical to these gains. Moreover, by leveraging host codon usage statistics, PISCO generalizes to optimize codon sequences for species absent from the training data. An autoregressive variant of PISCO further enhances concordance with natural codon usage patterns, at the cost of a modest reduction in codon recovery rate. Wet-lab validation confirms that PISCO-optimized sequences significantly enhance protein solubility and functional expression. Together, these results establish protein structure as a key determinant of species-specific codon optimization and provide a transferable framework for structure-aware gene design.
Lu, H.-E.; Koivisto, D.; Lou, Y.; Zeng, Z.; Yu, T.; Wang, J.; Meng, X.; Nowikow, C.; Wilson, R.; Kumbhare, D.; Pu, J.
Show abstract
Deep learning has transformed medical image and video analysis, but it usually requires large, well annotated datasets. In many clinical domains, especially when testing novel mechanistic hypotheses, such retrospective datasets are hard to obtain since acquiring adequate cohorts is time intensive, costly, and operationally difficult. This creates a critical translational gap: scientifically compelling early stage ideas may remain untested due to lack of sufficient sample size to support conventional deep learning pipelines. Developing data-efficient strategies for evaluating new hypotheses within small prospective cohorts is therefore essential to de-risk innovation before large-scale validation. Myofascial Pain Syndrome (MPS) exemplifies this challenge, as quantitative ultrasound imaging biomarkers for MPS remain underexplored. We investigated whether MPS in the upper trapezius can be detected from full B-mode ultrasound videos in a small prospective cohort (11 controls, 13 patients). Videos were automatically preprocessed and resampled using a sliding window strategy to expand training samples (404 clips). A self-supervised Video Diffusion Encoder (VDE) is developed to learn spatiotemporal representations without relying on extensive labeled data, and compared it with transfer-learning-based ResNet, VideoMAE, and SimCLR. Using subject-level stratified four-fold cross-validation, the VDE outperformed transfer learning baselines and achieved performance comparable to SimCLR, with subject-level AUC of 0.79 and accuracy of 0.86, and no significant differences between latent-only and combined trigger point analyses. These results demonstrate that self-supervised diffusion learning can support robust, data-efficient deep learning in small prospective studies, enabling early feasibility testing of innovative ultrasound biomarkers before large-scale clinical trials.
Tian, C.; Wang, J.; Hou, J.; Liu, W.; Luo, Y.; Wang, Y.; Yang, L.; Lin, W.
Show abstract
Olfactory perception arises from distributed activation across hundreds of olfactory receptors (ORs), yet our understanding of this landscape remains constrained by the scarcity of OR affinity measurements. Here, we present Receptor-Anchored Metric Supervision (RAMS), a transfer learning framework using perceptual consistency as weak supervision to predict OR activation spectra. RAMS fine-tunes a pretrained drug-target affinity model by imposing constraints derived from olfactory perception, where similar odorants are encouraged to exhibit similar OR activations. It transfers protein-ligand interaction knowledge learned from large-scale pharmacological data into the olfactory domain and reshapes it toward OR activation prediction. Evaluations against experimental measurements show that RAMS improves the accuracy of receptor-spectrum prediction and yields biologically plausible activation patterns. The predicted spectra show concordance between receptor discriminative capacity and expression level, and highlight the understudied OR52 family as a potential contributor to primary odor recognition. Together, RAMS provides a scalable framework for reconstructing receptor-anchored olfactory representations.
Shen, L.; Chao, L.; Liu, T.; Liu, Q.; Zhou, G.; Wang, H.; Dong, X.; Li, T.; Zhang, X.; Ni, J.
Show abstract
While protein language models typically rely on sequence-only pretraining objectives, this approach often fails to capture structural regularities and demands large computation. To address this, we introduce ProteinSage, a pretraining framework that learns protein representations under explicit structural constraints. ProteinSage incorporates structural signals via structure-guided masking and a causal objective designed to model long-range dependencies. This structure-constrained pretraining endows ProteinSage with highly transferable representations that achieve superior performance across diverse structure-aware and general protein modeling benchmarks, while requiring substantially less computation.To determine whether these gains stem from genuine structural generalization rather than task-specific fitting, we applied ProteinSage to a structure-driven protein discovery task, focusing on proteins with multi-pass trans-membrane helical architectures such as distantly related microbial rhodopsins. The model successfully identified six previously unannotated microbial rhodopsin homologs. Together, our work establishes structure-constrained pretraining as an effective pathway toward data-efficient and structurally faithful protein representation learning.
Tran, P. P.; Do, A. T.
Show abstract
In adversarial representation learning for fair prediction, the gradient reversal coefficient ({lambda}) is widely treated as the primary control for sensitive-attribute invariance. We show this assumption is wrong. Using a dual-stream architecture for cross-ancestry polygenic risk score (PRS) prediction, we demonstrate that latent dimensionality -- the information bottleneck -- accounts for 8-27 x more variance in ancestry leakage than adversarial strength. Varying{lambda} across a 20 x range changes leakage by only 2.2 percentage points; varying dimensionality across a 16 x range changes it by 46.6 pp. At dimension 8 with no adversarial training ({lambda} = 0), ancestry leakage is 32.9% (chance = 20%): the bottleneck alone achieves near-invariance. The adversary architecture (linear vs deep MLP) is equally irrelevant (0.6 pp range). We validate this finding across two unrelated domains -- genomic ancestry invariance (6 clinical traits, 1000 Genomes, n = 2,504) and EEG subject invariance (pretrained HFTP + Braindecode dual-domain model, 20 subjects) -- observing consistent dimensionality dominance (12.7:1 ratio in EEG). For the genomic application, Stream 1 encodes population structure via DCT-II frequencydomain features (136 coefficients); Stream 2 encodes phenotype signal from top PRS SNPs (PCA to 128 dimensions). The architecture works equally well with standard genomic PCA as the ancestry stream (R2 = 0.217 vs 0.222), confirming the contribution is architectural, not encoding-specific. African-ancestry PRS reconstruction R2 improves on all six traits (e.g., +5.1 pp for coronary artery disease). Linear models achieve higher aggregate R2 but fail catastrophically on cross-ancestry transfer (R2 = - 12.45 for African-ancestry CAD). We emphasize that we predict PRS (a computed score), not disease phenotypes; validation on biobank-scale phenotype data is ongoing. These results suggest the adversarial fairness community has been over-investing in adversary engineering relative to simple capacity control. Practitioners should select latent dimensionality first to set the information budget for the fairness-accuracy tradeoff, then optionally use adversarial training for marginal refinement.
Peddi, N.; Bijjula, D. R.; Gogte, S.; Kondaparthi, V.
Show abstract
Major Histocompatibility Complex (MHC) molecules are essential to the immune system because they bind and present peptide antigens to T cells, enabling immune recognition and response. The specificity of MHC-peptide interactions is crucial for understanding immune-related diseases, developing personalized immunotherapies, and designing effective vaccines. Current computational methods, while powerful, often rely on a single type of molecular information, usually sequence, and implicitly model the interaction between the two molecules. To address these limitations, we introduce MHC-Bind, a novel deep learning framework that captures a more comprehensive and biologically relevant view of the binding event. MHCBinds architecture employs a dual-view feature extraction strategy for both the MHC and the peptide. A Graph Attention Network (GAT) learns topological features from predicted residue contact maps, while a parallel 1D Convolutional Neural Network (CNN) captures multi-scale patterns from sequence embeddings. These four distinct feature sets are then integrated in a cross-fusion module that uses an attention mechanism to model interactions between the two molecules. Finally, a multi-layer perceptron (MLP) regression head maps the fused interaction signature to a precise binding affinity score. In rigorous comparative benchmarks against recent variants, such as NetMHCpan, MHCFlurry, and MHCnuggets, MHCBind demonstrates superior performance, achieving a significantly lower average prediction error (RMSE: 0.1485) and a higher correlation (PCC: 0.7231) in allele-specific contexts. For pan-allele tasks, it excels at correctly ranking peptides with a superior Spearmans Correlation (SCC: 0.7102), a crucial advantage for practical applications. The frameworks design is inherently flexible, excelling in both allele-specific and pan-allele prediction tasks.
Jia, Y.; Niu, J.; Qie, Z.; Li, Z.; Laine, A. F.; Guo, J.
Show abstract
Accurate classification of brain tumors from MRI is critical for guiding clinical decision-making; however, existing deep learning models are often hindered by limited interpretability and pronounced sensitivity to hyperparameter selection, which constrain their reliability in medical settings. To address these challenges, we propose TumorCLIP, a lightweight and training-efficient vision-language framework that integrates radiology-informed text prototypes with a DenseNet-based visual encoder to support clinically meaningful semantic reasoning, fused via a Tip-Adapter mechanism. TumorCLIP does not aim to introduce a new vision-language model architecture. Instead, its contribution lies in the integration of radiology-informed text proto-types tailored to MRI interpretation, a systematic evaluation of backbone stability across diverse visual architectures, and a lightweight, training-efficient CLIP-based fusion framework designed for medical imaging applications. We first conduct a comprehensive unimodal benchmark across eight representative visual backbones (EfficientNet-B0, MobileNetV3-Large, ResNet50, DenseNet121, ViT, DeiT, Swin Transformer, and MambaOut) using a standardized optimizer and learning-rate grid search, revealing performance swings exceeding 60 percentage points depending on hyperparameter choices. DenseNet121 shows the strongest stability-accuracy trade-off within our evaluated optimizer and learning-rate grid (97.6%). Leveraging this foundation, TumorCLIP fuses image features with frozen CLIP-derived text prototypes, achieving concept-level explainability, robust few-shot adaptation, and enhanced classification of minority tumor classes. On the test set, TumorCLIP attains 98.5% accuracy, including a +1.86 percentage point recall increase for Neurocytoma, suggesting that radiology-informed textual priors can improve semantic alignment and help refine diagnostic decision boundaries within the evaluated setting. Additional evaluation on an independent external dataset shows that TumorCLIP achieves improved cross-dataset performance under the evaluated distribution shift, relative to the unimodal DenseNet121 baseline. These results demonstrate TumorCLIP as a practical, interpretable, and data-efficient alternative to conventional visual classifiers, providing evidence for radiology-aware vision-language alignment in MRI-based brain tumor classification. All results are reported within the evaluated datasets and training protocols.
Wang, L.
Show abstract
We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2-v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture: a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5/0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench -- gaining +14.4, +22.6, +6.7 percentage points over the vocabulary-extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96%/4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.
Pandey, S.; Talo, M.; Siderovski, D. P.; Sumien, N.; Bozdag, S.
Show abstract
Identifying new therapeutic uses for existing drugs is a major challenge in biomedicine, especially for complex neurodegenerative conditions such as Alzheimer disease and related dementias (ADRD), where treatment options remain limited and relevant data are often sparse, heterogeneous, and difficult to integrate. Although general-purpose Large Language Model (LLM) embeddings encode rich semantic information, they often lack the task-specific biomedical context needed for inference tasks such as computational drug repurposing. We introduce Contextualizing LLM Embeddings via Attention-based gRaph learning (CLEAR), a multimodal representation-fusion framework that aligns LLM embeddings with the topological structure of a context-specific Knowledge Graph (KG). Across five benchmark datasets, CLEAR achieved state-of-the-art results, improving predictive performance (e.g., F1 score) by up to 30% over prior methods. We further applied CLEAR to identify FDA-approved drugs with potential for repurposing for ADRD, including Parkinson disease-related dementia and Lewy Body dementia. CLEAR learned a biologically coherent embedding space, prioritized leading ADRD drug candidates, and accurately summarized known therapeutic relationships for FDA-approved Alzheimer disease drugs. Overall, CLEAR shows that grounding biomedical LLM embeddings with context-specific KG signals can improve drug repurposing in data-sparse, real-world settings. GitHub: https://github.com/bozdaglab/CLEAR
Colangelo, G.; Marti, M.
Show abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
Wen, Y.; Xiong, J.; Gong, F.; Ma, L.; Wan, L.
Show abstract
Single-cell RNA sequencing combined with lineage tracing technologies provides rich opportunities to study development and tumor evolution, yet existing computational methods struggle to disentangle intrinsic transcriptional states from lineage-driven effects. We introduce DeepTracing, a deep generative framework that integrates disentangled representation learning with lineage-aware Gaussian processes to explicitly separate intrinsic cellular variation from lineage constraints. The model constructs a layered latent space and enforces independence via Total Correlation regularization, producing intrinsic, lineage, and unified embeddings. Across extensive benchmarks, DeepTracing consistently outperforms existing approaches. In TedSim simulations, it achieves superior clustering of cell states and effectively recovers phylogenetic structure, surpassing original expression and scVI. Applied to mouse tumor lineage-tracing data, DeepTracing attains higher ARI/NMI for tumor-type classification than scVI and PORCELAN, accurately separating primary and metastatic tumors and recovering known trajectories such as early lymph-node divergence and liver-to-kidney cross-seeding. In larger datasets, it maintains strong performance while preserving both transcriptomic continuity and lineage fidelity. DeepTracing also reconstructs continuous developmental trajectories in mouse ventral midbrain, isolating temporal effects from intrinsic differentiation. These results establish DeepTracing as a scalable and interpretable framework for analyzing multimodal single-cell data in tumor progression. Code availabilityThe source code is publicly available at https://github.com/Yuhong-Wen/DeepTracing.
Zhang, J.; Schwartz, M. A.; Mutaher, M.; Olajide, O.; Pritykin, Y.; Ashenberg, O.; Hacohen, N.; Uhler, C.
Show abstract
Perturbations of genes with functional importance in T cells could be used to change the distribution of CD8 T cell states to enhance anti-tumor functions for cancer immunotherapies. We launched a world-wide computational challenge to predict the effects of gene perturbations and to devise objective functions for prioritizing gene perturbations that lead to desired T-cell state distributions. We supported the challenge by generating a single-cell Perturb-seq dataset profiling the effect of knocking out 73 individual expert-defined genes in T cells transferred into a mouse melanoma model. We compared the top algorithms developed by participants, and found that performance was primarily determined by the prior data used for gene feature representation, with perturbational data derived features, proving most effective. Experimental validation of the top 61 genes nominated by the algorithms revealed that perturbation of Ndufv2 and Dimt1 reached the defined objective and biased T cell differentiation toward desired states.